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ABSTRACT 

The advent of web services? that use XML- based message ex- 
changes has spurred many efforts that address issues related 
to inter-enter prise service electronic commerce interactions. 
Currently- emerging; standards and technologies enable en- 
terprises to describe and advertise their own Web Services 
and to discover and determine bow to interact with services 
fronted by other businesses. However, these technologies 
do not address the problem of how to reconcile structural 
differences between similar types of documents supported 
by different enterprises. IVansfor mat ions between such doc- 
uments must thus be created manually on a case- by- case 
basis. In this paper, we explore the problem of how to au- 
tomate the transformation of XML L-business documents. 
We develop an integrated solution that automates as much 
as possible all steps of the document transformation pro- 
cess. One, we propose a set of schema transformation op- 
erations that establish semantic relationships between two 
XML document sen em as. Two, we define a model that al- 
lows us to compare the cost of performing these operations. 
Three, we introduce an algorithm that discovers an eflicient 
sequence of operations for transforming a source document 
schema into a target document schema based on our cost 
model. The operation sequence then is used to generate an 
equivalent XSLT transformation script. Experimental re- 
sults indicate that our algorithm can satisfactorily discover 
acceptable transformations. 

L INTRODUCTION 
1.1 Motivation 

Web services [9] are significantly more loosely coupled than 
traditional applications. Web services are deployed on the 
behalf of diverse enterprises, and the programmers who im- 
plement them are unlikely to collaborate with each other 
during development. However, the purpose of web services 
is to enable business- to- business interactions. Web services 
should be able to discover new services and interact with 
them dynamically without requiring developers to update 
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the code of either service. This need is spurring the cre- 
ation of electronic commerce standards such as ebXML [16] 
and the Web Services Conversation Language (WSCL) (lj 
that all support different aspects of inter-enterprise service 
interactions. 

However, these technologies do not address the significant 
problem of how to reconcile structural differences between 
similar types of documents supported by iwo different enter- 
prises. For example, let Service A and Service B be services 
that front different companies. Suppose that these services 
want to engage in a shopping cart interaction, and that Ser- 
vice B requires Service A to submit a shipping information 
document. Service A might be able to provide this infor- 
mation, but in a slightly different format than Service B 
expects. For example, Service A's document might list the 
address first and Service B's might list it last. Similarly, 
Service 3 might call the zip code element "Postal Code" 
where Service A names it "Zip Code". 

Currently, the only recourse of service developers is to cre- 
ate a transformation between the two documents by hand. 
Manual translation of the XML documents is time consum- 
ing and thus especially unacceptable for web services, where 
the information sources change frequently. Hence applica- 
tions must evolve quickly. Clearly, tools are needed to au- 
tomate or at least support this process in <is much as is 
possible. 

1.2 State of the Art 

Schema Translation. AFtTEMlti [2, 3] supports the anal- 
ysis and reconciliation of sets of heterogeneous relational 
schemas by measuring the similarity of element names, data 
types, and structures. Clio [11, 21] uses reasoning about 
SQL queries to create initial mappings between relational 
schemas, then refines these mappings using data examples. 
However, because relational schemas are fiat, neither Clio 
nor ARTEMIS can handle hierarchical XML schemas. 

Tranb'cm [12] uses schema matching to derive au automatic 
translation between schema instances. All input schemas are 
transformed into a common model, namely, labeled graphs. 
It offers a set of "rules" that describe how to match a compo- 
nent (i.e., a node in the labeled graph) in the source schema 
with a corresponding component in the target schema. The 
matching is performed node by node starting at the top. 
Rules are checked in a fixed order based on their priorities. 
However since 'l\un$cm is a system aiming to provide a 
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general approach, issuer such as what special XML match- 
ing rules should be provided to the rule base and assignment 
of priority for each rule would first need to be solved to put 
the approach in the XML context. 

The machine-learning approach taken in [6] attempts to 
train a learner by a set of user- provided mappings from a 
data source to the global schema and then discovers the 
characteristic instance patterns. Given a new data source, 
one-to-one mappings between the leaf nodes of two schema 
trees can be established by trying out those learned matches. 
However it does not match source-schema elements with a 
hierarchical structure, i.e., the inner nodes in the schema 
tree, as needed for XML. Furthermore, in cases where ex- 
ample data sets of both source and target XML documents 
are available, such an approach could not be applied. 

Tree Matching. Much work has been done in the area of 
tree matching. [14] and [22] address the change detection 
problem for ordered and unordered trees respectively. How- 
ever, the tree matching problem treats the label of each node 
as a second class citizen. For example, the cost of relabeling 
is assumed to be cheaper than that of deleting a node with 
the old label and inserting a node with the new label. How- 
ever if we model an XML schema as a tree, some labels of 
the nodes can be names of the XML tags which carry seman- 
tic meaning. A relabel from one node to another semantic 
unrelated node will cause an undesirable result. Thus the 
assumption is invalid for the XML domain. We overcome 
this limitation in our work. 

LaDiff [5] adapts a simple cost model in which insert, delete 
and move are all unit cost operations, i.e., cost is 1. We 
now refine the cost model to take XML characteristics into 
account. LaDiff also assumes that each node of the input 
trees lias a special label that describes its semantics (seman- 
tic tag). For example, a tree representing a document may 
have tags "paragraph'', "section", etc. And for each leaf 
node in the source tree, there is at most one leaf node in 
the target tree that is "close'" to it (unique close partner). 
These assumptions facilitate the matching. MH-Diff [4] al- 
lows flexible cost models and drops the assumptions in [5] 
but then takes quadratic time in the size of the input. 

There are a number of differences between the tree match- 
ing problem studied in [5, 4] and the specific problem of 
matching trees that model XML schemas. First, some of 
their edit operations such as copy and glue in [4] are not 
meaningful for an XML schema. Instead we need XML- 
schema-specific edit operations. Second, the assumption of 
unique close partner in [5] does not necessarily hold true. 
Meanwhile the assumption of semantic label holds for some 
of the nodes in the XML model. That is, some of the nodes 
do not have tags describing their semantics (e.g., constraint 
nodes which will be introduced in Section 2) while others 
do have them (e.g., tag nodes). Hence it is not possible to 
only use the assumptions to direct the mapping. Neither is 
it suitable to completely discard the assumption as (4) has 
done which will result in a high time complexity. 

XML Document Restructuring. (7) studies how to change 
an XML document in terms of both schema and data. It pro- 
pose a set of DTD change primitives that can be applied to 



an old DTD to derive a new DTD and corresponding dala 
change will be made implicitly as well. These primitives, 
however, do not cover all the XML schema mapping* that, 
are most likely to happen. 

1.3 The Xtra Approach 

Since DTDs are currently the dominant industry. standard, 
we address the problem of how to transform a document, 
conforming to a source DTD so that it will conform to a 
target DTD. Our approach could easily be adapted to XML 
Schema (IS). Given a source and a target DTD, we first 
model each DTD as a tree. This allows us to express the 
problem as how to transform one DTD tree into another. To 
this end. we have defined a set of DTD transformation oper- 
ations that establish the semantic relationships between two 
trees. We also define a cost model for choosing a sequence of 
transformation operations among multiple alternatives. We 
have developed an algorithm to discover a sequence of oper- 
ations (i.e., transformation script) that transforms a source 
DTD tree into a target DTD tree. The discovery process is 
based on provided auxiliary information (e.g., synonym dic- 
tionary, domain ontology, etc.) and a cost model we define 
for choosing a transformation script among multiple alter- 
natives. Lastly, we use the resulting transformation script, to 
generate a extensible Stylesheet Language Transformations 
(XSLT) script (19). The XSLT script can then be applied to 
source XML documents to transform them into XML doc- 
uments conforming to the target DTD. Figure 1 shows the 
architecture of our system. 




Figure 1: Xtra System Architecture 



The primary contributions of our work include: 

1. We propose a set of DTD transformation operations 
that capture common discrepancies between alterna- 
tive DTD design behaviors for modeling real-world 
data. 

2. We define a cost model (based on the concept of data 
capacity) for measuring the quality of DTD transfor- 
mations. 

3. We have implemented an XML TRAnslation proto- 
type system (Xtra) 1 , and run experiments on both real 
and synthetic XML document to verify the feasibility 
of our approach. 



'Xtra has been demonstrated at ACM SIC MOD 2001. 
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2. DTD DATA MODEL 

Document. Type Definition (DTD) [17] describes the struc- 
ture of XML documents as a list of element type decla- 
rations. Elements can in turn have content particles, at- 
tributes or be empty. The structure of elements is defined 
via a content-model built out of operators applied to its 
content, particles. Content particles can be grouped as se- 
quences (e.g., a,b) or as choices (e.g., a\b) with both a and 6 
being content particles again. For every content particle, the 
content-model can specify its occurrence in its parent con- 
tent particle using regular expression operators (i.e., ?,*,+). 
Attributes can be of various types such as ID, CD ATA, etc. 
They can be optional (# IMPLIED) or mandatory (# RE- 
QUIRED). Optionally, attributes can have a default or a 
constant value (# FIXED). We model an element type dec- 
laration as a t.ree. denoted as T = (N, p, I), where /V is the 
set of nodes, p is the parent, function representing the parent 
relationship between two nodes, and / is the labeling func- 
tion representing the properties of a node. We categorize a 
node n £ A' based on its label 1(a). 

• Tag node: 

- Element node: Each element, node n is associ- 
ated with an element type T . l(n) is a singleton in 
the format of [Name] where Name is T's name. 

- Attribute node: Each attribute node n is asso- 
ciated with an attribute type T. l(n) is quadruple 
in the format of [Name, Type, Def, Vol] where 
Name is T's name, Type is T's data type (e.g., 
CDATA, etc.), Def is T's default, property (e.g., 
# REQUIRED, ^IMPLIED, etc.), and Vol is T's 
default or fixed value if any. 

• Constraint node: 

- List node: Each list node n indicates how its 
children are composed, that is, by sequence (i.e., 
l(n) = [V]) or by choice (i.e., l(n) = [T])- 

- Quantifier node: It represents whether its chil- 
dren occur in its parent's content model one or 
more (i.e., 1(g) = ["+"], called plus quantifier 
node), zero or more (i.e., 1(g) = ["*"], called star 
quantifier node), or zero or one times (i.e., 1(g) = 
["?''], called qmark quantifier node). 

A tree rooted at a node of element type T is called T's type 
declaration tree. We assume each DTD has a unique root 
element type whose type declaration tree is then the DTD 
tree. For example, Figure 2 shows two sample DTDs of 
web-service purchase orders. These two DTDs will be used 
as the running example through the paper. The DTDs are 
modeled as DTD trees in Figures 3 and 4. For simplicity, 
we mark each node with its name rather than a complete 
label. In the following, we represent each node by its name 
n with a subscript i indicating the number of the DTD it 
is within (i.e., 1 or 2). The format is then <-n>, 2 . Since 
each element type declaration is composed of a list of con- 
tent particles enclosed in a parenthes (optionally followed 
by a quantifier), we do not explicitly model the outermost 
parenthesis construct, as a sequence list node in the DTD 
trees. For example, <name>2 has two children <fi.rsl>2 
and <last> / 2 rather than a sequence list node. 

2 If there is duplicate name, another subscript specifying 
which one it is can be added, i.e., <n>>. 



< ! ELEMENT company (address, cnajne, personnel^ 
< ! ATTLI ST company license CDATA tt I HPL I ED> 
<!ELEMENT address (street, city, state, zip)> 
< ! ELEMENT cnajne («PCDATA)> 

< ! ELEMENT personnel (person)^ 
<! ELEMENT street <sPCDATA)> 

< ! ELEMENT city («PCDATA)> 

< ? ELEMENT state (»PCDATA)> 

< ! ELEMENT zip («PCDATA)> 

< ! ELEMENT person (name, erna i 1 ? ,ur 1° , fax+)> 

< ! ELEMENT name ( f ami 1 y I g i ven Imi ddle ? ) * > 
< ! ELEMENT email (»PCDATA)> 

<! ELEMENT url («PCDATA)> 
< ! ELEMENT fax (»PCDATA)> 
< ! ELEMENT family («PCDATA)> 
< ! ELEMENT given («PCDATA)> 
< ! ELEMENT middle (»PCDATA)> 

Ca) 

< ! ELEMENT company (cname , (s t reet , city, state, postal), 

personne 1 ,license 7 )> 
< ! ELEMENT cname ( 8PCDATA ) > 
< ! ELEMENT street («PCDATA)> 
< ! ELEMENT city (»PCDATA)> 

< t ELEMENT state («PCDATA)> 

< ! ELEMENT postal (8PCDATA)> 
< ! ELEMENT personnel (person)+> 
< ! ELEMENT license («PCDATA)> 

<!ELEMENT person (name, email*, url?, fax, fax?, phonenum)> 

<! ELEMENT name (first, last)> 

< ! ELEMENT email («PCDATA)> 

< ! ELEMENT url (»PCDATA)> 

<! ELEMENT fax («PCDATA)> 

<! ELEMENT phonenura (»PCDATA)> 

< ! ELEMENT first (8PCDATA ) > 

< ! ELEMENT last («PCDATA)> 

(b) 

Figure 2: Example DTDs of Web Service A and B's 
Purchase Orders 

3. TRANSFORMATION OPERATIONS 
3.1 Taxonomy of Transformation Operations 

We identify two primary causes of discrepancies between the 
DTD components modeling the same concepts: First, the 
properties of the concepts may differ. For example, phone 
number is required in contact information in DTD 2 while 
it is not required in DTD 1. Second, due to the relatively 
free- form nature of XML and lack of a standard for DTD 
design, a given concept can be modeled in a variety of ways. 
For example, an atomic property can be represented as ei- 
ther a # PCDATA sub-element or an attribute. We have 
studied the common DTD design patterns and correspond- 
ingly proposed a set of transformation operations, as listed 
below 3 . 

1. add(T, n): Add a subtree T under node n, i.e., add a 
new content particle to element n\s content model. 

2. rnscrtfn, p, C): Insert a new node n under node p with 
n a quantifier node or a sequence list node and move a 
subset of p's children C to become n's children. If a is a 
quantifier node, the operation changes the occurrence 
property of the children C in p's content model from 
"exactly once" to correspond to n. If n is a sequence 
list node, it groups the nodes C. 

n cannot be an attribute node since an at tribute node 
would not have any children. And we do not allow n to 
be an element node because it may cause undesirable 
matches. See Example 1. 

3 We use "child" to refer to direct child versus "descendants". 
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Figure 3: DTD l's DTD Tree 




Figure 4: DTD 2's DTD Tree 

3. delete(T): Delete subtree 7, i.e., delete a content par- 
ticle 7 from a content model. This is the reverse op- 
eration of add. 

A. rernovc(n): Remove node n with n a quantifier or a 
sequence list node. All n's children now become p(n)'s 
children. This is the reverse operation of insert. 

5. rclabel(n, I, V): Change node n's original label / to/'. 

• relabel within the same type (the operation does 
not. change the node's type): 

(a) Renaming between two element nodes, two 
iit tribute nodes or two quantifier nodes but not 
between a sequence list node and a choice list 
node; (b) Conversion between an attribute's de- 
fault type ^REQUIRED and ^IMPLIED. 

• relabel across different types (the operation changes 
the node's type): 

(a) Conversion between a sequence list node and 
an element node which has children. This cor- 
responds to using a group instead of an element 
type or encapsulating the group into a new ele- 
ment type. See Example 2. (b) Conversion be- 
tween an attribute node with type CD ATA, de- 
fault type # REQUIRED, no default or fixed value 



and a #PCDATA element node; (c) Conversion 
between an attribute node of default property 
^IMPLIED and a # PCDATA element nock; with 
a qmark quantifier parent, node, (h) and (c) are 
allowed in that people can model a one-to-one re- 
lationship property as either an tit tribute or an 
snbelement. See Example 3. 

6. unfold(T, <Ti.T 2 7,>): Replace subtree 7 with a 

sequence of subtrees T\ . 7b 7,. T must root at. a 

repeatable quantifier node. T\ . Ti and 7, satisfy 

that: (1) they are adjacent siblings: and (2) they or 
their subtrees without, a qmark quantifier root node 
are isomorphic, unfold recasts a repeatable content 
particle as a sequence of non-repeatable content parti- 
cles. See Example 5. 

7. fold{<T\ , 72, 7,*>, 7): This is the reverse operation 
of unfold. 

8. split (mi ,<7ii , n-2>): A sequence list node rr?i is split 
into a star quantifier node ni and a choice list node n>. 
Because there is no DTD operator to create unordered 
sequences, tuples <a,b> tend to be expressed using 
the construct (a\b)* rather than (a t b)\(b, a). This op- 
eration corresponds to converting an ordered sequence 
to an unordered one. See Example 5. 

9. merge(<n] ,n-2> ,m\): m and n.2 are merged into a 
single node m\ with n\ a star quantifier node, n-i a 
choice list node and m\ a sequence list node. This is 
the reverse operation of split. . 

Example 1. For the DTDs shown as below, if. is map. 
propriate to derive <nam e> 2 from <nnme>\ by inserting a 
lag node CEO between company and name srn.ee <nome>2 
indicates the company's CEO's name while <name>\ in- 
dicates the company's name. We would rather first delete 
name and then insert a subtree rooted at CEO to derive 
DTD 2. 

< 'ELEMENT company (narae , address .webpage) > 
<! ELEMENT najoe (»PCDATA)> 
< ! ELEMENT address (a PCDATA ) > 
< ! ELEMENT webpage ( 8 PCDATA ) > 
DTD 1 

<! ELEMENT company (CEO , webpage ) > 
< ! ELEHENT CEO (name, address) > • 
< ! ELEMENT narae («PCDATA)> 
< ! ELEMENT address (»PCDATA)> 
< ! ELEMENT webpage («PCDATA)> 
DTD 2 

Example 2. <\E LEM E NT company (street, city , state. 
zip)> can be transformed from <!ELEM ENT compa ny 
(address )> <!ELEM ENT address (street, city, state, zip)> 

Example 3. <\ELEM ENT company (iicen.se 7 )> 
<!ELEM ENT license (PCDAT/\)> can be transformed 
from < .'ELEMENT company (EM PTY)> < /ATT LI ST 
company license CD AT A #1 M PLI ED> by a relabel op- 
eration. 

Example 4. <\ELEM ENT person (fax +)> can be un- 
folded to <\E LEM ENT person (fax, fax) or 
< IE LEM ENT person (fax, fax?J>. 

Example 0. <\E LEM ENT name ( f irst , last )> is split 
into <\ELEM ENT na me (first\last)'>. 
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3,2 Constraints on the Transformation 

While our atomic operations reflect intuitive transforma- 
tions, some combinations of operations may result in non- 
intuitive transformations. For example, for the DTDs shown 
in Example 1 in Section 3.1, we can derive DTD 2 from DTD 
1 by first inserting a sequence list node above name and 
address, and then relabeling the sequence list node to tag 
node CEO. This is equivalent to the forbidden operation of 
inserting a tag node CEO above name and address. 

Common design patterns show that an element type dec- 
laration will not be deeply nested. A survey of real world 
DTDs [13] analyzes 65 DTDs available at [20]. The depth of 
content models is defined as: 0 for EMPTY', 1 for a single 
element, a sequence or a choice; n for an alternation of 
sequences and choices of depth n. For example, (a, (b\(c.d))) 
has depth 3. It turns out that the maximum depth of an 
clement type's content type is almost around 2 and 3 with 
the average depth being even lower. This is because com- 
plex regular expressions are not advisable since it is difficult 
to understand. Also, usually the complex expression can 
be rewritten by some simpler ones. According to this de- 
sign pattern, if i\ node nj has a matching partner n>, it is 
highly likely that n\ and /12 have a similar depth in the sub- 
trees rooted at their nearest matching ancestors in the DTD 
trees. This gives a hint that if the DTDs are designed in a 
common manner, the search space can be pruned to gain 
time efficiency. Therefore we discover only change scripts 
that do not violate the constraint that a node can be only 
operated on by a non-rclabel operation optionally followed or 
following a relabel operation 4 . 

4. COST MODEL OF OPERATIONS 

These operations can be combined into a variety of equiv- 
alent transformation scripts. In order to facilitate selec- 
tion among alternative transformations, we propose a cost 
model that evaluates the cost of transformation operations 
in terms of their impact on the data capacity of the docu- 
ment sen em as. Relative information capacity [$] measures 
the semantic connection between database schemas. That 
is, two schemas are considered equivalent if and only if there 
is a one-to-one mapping between all data instances in the 
source and the target schema. We assume that the DTDs 
in our study are flat [10], i.e., no schema information such 
as the names of element or attribute types in one DTD are 
stored as PCDATA or values of attributes in an XML doc- 
ument conforming to another DTD. Hence we only consider 
PCDATA and attribute values in XML documents as data. 
The data capacity of an XML document denotes the collec- 
tion of all of its data. 

4.1 Factors of the Cost Model 

Data capacity gap. We say a transformation operation is 
data capacity reducing (DC- Reduce) if it must result in the 
loss of data, e.g., delete a subtree. Correspondingly, we have 
data capacity increasing ( DC- Increase) operations, e.g., add 
a subtree, and data capacity preserving (DC- Preserve), e.g., 
relabel an element node to a sequence list node. However, 
for some operations, it is difficult to determine from the 
DTDs alone whether the transformation will result in the 

4 Relabel is the only operation that does not change the 
tree's topology. 



loss, addition, or preservation of data capacity. For exam- 
ple, the operation remove quantifier node <"*"> changes 
the content particle from non-required to required which may 
cause an increase in data. It also changes the content parti- 
cle from rcpeatablc to non-repeatablc which may cause data 
reduction. Hence reducing, increasing or preserving of data 
capacity are all possible and depend on the individual source 
XML document. We call these transformations data capac- 
ity ambiguous (DC- Ambiguous). We use DC (op) to denote 
the cost that the data capacity gap of t he operation np con- 
tributes to op's overall cost. For any two ops falling into the 
same category, DC(op) is the same and 0 < DC(op) < }. 

Potential data capacity gap. Although some transfor- 
mations are data capacity preserving, there may still be a 
potential data capacity gap between a document conforming 
to the source DTD and one conforming to the target DTD. 
For example, the operation insert, a quantifier node < :: -f- : > 
is a DC- Preserve transformation. However, it changes the 
children content particles' occurrence property from count- 
able to non- count able and then allows the XML documents 
to accommodate more data in the future. We use PDC(op) 
to denote the cost that the potential data ca parity gap 
contributes to operation op's overall cost. Then we define 
PDC(op) = 'unrequired * required .changed(op) + t<>coi ia f«fcie * 
r epea table jcountable(op), where required .changed (op) and 
countable. changed (op J are two boolean functions that in- 
dicate whether the properties "required" or "countable" of 
the content particles that are operated on by op are changed 
or not. Weights Wr^quirtd and u.' eoun/n m« indicate the impor- 
tance of the change of the corresponding property for the po- 
tential data capacity. They satisfy xv r tqui rt <i +w C owni obit = 1. 
Only operations that insert, remove or relabel a quantifier 
node may have a PDC cost that is not 0. For example, 
suppose w rcqu ir t d = Wcouniabtt — 0.5, then for op of in- 
sert a quantifcr node < "+">, required. changed(op) = 0, 
countable .changed(op) = 1, therefore PDC (op) = 0.5 * 0 + 
0.5 * 1 = 0.5. 

Relative factors of operands. The number, size or prop- 
erty of operands involved in an operation rimy impact the 
data capacity or the potential data capacity gap. We use 
Fac(op) to denote the factor. For example, for a content 
model (/qi- + ), a fold operation deriving it from (fax, fax) 
will be more expensive than that deriving it from (fa.i^ 
fax, fax). This is because the former one causes a greater 
potential data gap. For another example, when relabel- 
ing occurs between two tag nodes, if their names are syn- 
onyms (e.g., "zip" and "postal'), Fuc(op) is 0. If no knowl- 
edge about the relationship of the two names' relationship 
is available, then Fac(op) is proportional to the similarity of 
their name strings. For example, Fac(op) of relabeling be- 
tween "address" and "addr" will be less than that between 
"address" and "capital". 

We then have, Cost(op) = (DC(op) + PDC(op)) * Fac(op). 
The user of the Xtra system can customize the cost model by 
tuning the DC (op), PDC (op) and Fac(op). We also provide 
a default setting. Intuitively information loss is not desirable 
in that old information cannot be reconstructed from the 
new information. Hence the more information loss the oper- 
ation causes, the more expensive the operation is. Therefore 
we rank the cost of each data capacity gap category from 
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lower t o higher in t he order of DC- Preserve, DC' Increase, 
DC- A mbiguous and DC- Reduce. However our algorithm of 
discovery the transformation script does not depend on this 
particular relationship. 

4.2 Examples 

We illustrate how to use the cost model to choose a match- 
ing plan from multiple candidates using the running exam- 
ple. Assume we have the following settings: DC- Preserve, 

0. 6; DC- Increase, 0.S; DC- Ambiguous, 0.9; DC- Reduce, 1.0. 
And Fac(op) for an operation op of relabeling between an 
element node and a sequence list, node is 3 while that for re- 
labeling between two synonym element nodes is 0. Also, for 
an operation op that adds a subtree, we define that Fac(op) 
is proportional to the number i of subtree's leaf nodes (i.e., 
Fac(op) = k * 0 and we assume k = 1 . Suppose now we 
have two options to derive the subtree rooted at <, > 2 . The 
first option is to match <address> i to <,>2, i.e., relabel 
<address >\ from [address] to (,] and relabel <zip>\ from 
[zip] to [postal]. The second option is to add a new subtree 
which is the one finally rooted at (<, >2- For the first op- 
tion, the cost of the first relabeling is (DC(op)-r PDC(op)) * 
Fac(op) — (0.6 -I- 0) * 3 = 1 .8 while the cost of the second re- 
labeling is (DC(op) + PDC(op))* Fac(op) = (0.6 + 0)*0 = 0. 
The total cost is then l.S + 0 = 1.8. For the second option, 
the cost is {DC(op) + PDC(op)) * Fac(op) = (1 .0 + 0) *4 = 4. 
In this case, the first option is preferable due to its lower 
cost. However, suppose that < address >] only has one 
single child node <postal>\. The first option now has three 
more operations than the original sequence of operations, 

1. e., add <street>i , add <city>2 and add <siate> i . These 
three additional operations cost (0.8 + 0)*l = 0.8 each. The 
total cost of the first option is therefore 1.8 + 0.8 * 3 — 4.2. 
The cost of the second option is (i.e., 4) does not change 
since the sequence of operations does not change. This time 
the first option is not preferable since it is more expensive 
than the second option. 

5. GENERATION OF DTD TREE MATCHES 
5.1 Simplified Element Trees 

We constrain our investigation to the domain of E-business 
documents that are exchanged between services that share 
a common ontology. We thus can use name similarity as 
the first heuristic indicator of a possible semantic relation- 
ship between two tag nodes. For example, in Figures 3 and 
4, the matching document root type all have a child node 
named personnel, so without looking at their descendants, 
we can match these two nodes. We can derive the match- 
ing between two personnel nodes' descendants by comparing 
two personnel's type declaration trees separately. However 
suppose in DTD 2, people were used instead of personnel and 
no synonym knowledge was given. We would then need to 
look further at personnel and people's descendants to decide 
whether to match them. 

We therefore introduce the concept of a simplified element 
tree. Such a tree is designed to capture the relationship 
indicated by names between specific elements of the two 
documents. When two DTDs are provided, we say a tag 
node is non- rename -able if there exists any tag in the other 
DTD whose name is the same or a synonym. In another 
word, the Fac{op) cost of an operation op of renaming a 



non-r'enam e- able node to another node with a non-synonym 
name is infinitely expensive. A simplified element tree of 
element type £*. denoted as ST, is a subtree of /T's type 
declaration tree T that roots at T's root, with each branch 
ending at the first reachable non-rename-able node. In Fig- 
ures 3 and 4. the four subtrees within the dashed lines are 
simplified element trees of company, personnel, person and 
name in the two DTDs respectively. 

5.2 The Matching Algorithm 

DMatch(Dtc] Match) is an X ML-structure-specifie tree match- 
ing algorithm. The general unordered tree matching prob- 
lem is a notoriously high complexity NP problem [22]. As we 
have mentioned in Section 1.2. the typical assumption about 
relabeling in general tree matching does not hold in our sce- 
nario. Thus those techniques do not apply to our scenario. 
Our dMatch tree matching algorithm here incorporates the 
domain characteristics of specific DTD tree transformation 
operations, the imposed constraints and the cost model. 

Given a source simplified element tree T\ , and a target sim- 
plified element tree 7~2, we call nodes in T\ souice nodes and 
nodes in 7~2 target nodes. If nj and n 2 are a source and 
a target node respectively, the DMateh(n. x , n?) algorithm 
discovers a sequence of operations (i.e., the transformation 
script) that transforms the subtree rooted tit n.\ to the sub- 
tree rooted at n?. We call the sequence of operations a trans- 
formation script. The total cost of the script is then the cost 
of matching (deriving) n\ and nj. For the source nodes that 
are deleted or removed, since they are not mapped to any 
existing target node, in order to simplify the description, we 
specify two special nodes, namely, <£] and c >2- A node which 
is removed is said to be mapped to <J>i. And a node which 
is deleted is said to be mapped to <5>2 

DMatch is composed of two phases. The first phase is pie- 
processing. And We have two special nodes, namely, <l>i 
and $2- $i is mapped to nodes which are operated on by 
remove operation. And f I>2 is mapped to nodes which are 
operated on by delete operation. The operations odd, -in- 
sert, delete, remove and relabel set up a one-to-one mapping 
relationship. However, the operations unfold, fold, split and 
merge set up a one-to-many relationship. For example, un- 
fold maps one subtree to multiple subtrees, split, maps two 
nodes (a star quantifier and a choice list node) to a sequence 
list node. In order to make the matching discovery process 
for each node be uniform, we preprocess each of the simpli- 
fied element trees. 




"I (bj 



Figure 5: Preprocessing: Fold 
In the preprocessing phase, we will convert all the repre- 
sentations of a sequence of non-repeatable content particles 
into the format of a repeatable content particle, i.e., perform 
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fold operations. For example. Figure 5 (a) will be converted 
to Figure a (b). In Figure 5 (b). we mark the plus quanti- 
fier node with a number (i.e., two) indicating the maximum 
occurrence of content particle fax. By marking these nodes 
resulting from the preprocessing, we are able to keep track 
of where they are derived from. Thus we will not lose the 
information needed for computing the overall cost. Second, 
we will impose a certain order on those representations that 
allow arbitrary combination of content particles, i.e.. per- 
form merge operations. 

In the second phase, we then find one-to-one node mappings. 
To derive the transformation from the subtree rooted at nj 
and the subtree rooted at n 2 , for each child »i] of nj, we 
attempt, to find a matching partner m? (a matching partner 
can be either be also a special node <T>i or $2) This match- 
ing discovery is done in two passes. In pass 1, we visit each 
child r?7 1 of n.\ sequentially and compare it with a certain 
set of target nodes. We call the set of nodes that, will be 
compared with the current source node matching candidate 
set. Since we have the constraints that a node cannot be di- 
rectly operated on more than once, mi's matching partner 
/mj can only be on the same level as mi (no operation or 
relabel operated on mi) or one level deeper than mj (insert 
operated on m 1 ) or a special node (delete or remove oper- 
ated on mi). By recursively applying DMatch to mi and 
each node Si (0 < i < \S\) in S\ we find a node 5a- with the 
least matching cost c. We have a control strategy to decide 
whether to match mi with s^- In pass 1, we apply a delay - 
match scheme which delays matching Sk with if the cost 
is not, low enough, i.e., c is not less than the cost of deleting 
mi. Thus Sk can be preserved to be matched with another 
node associated with a satisfactorily low cost. 



Source 


Matching Candidate Set 


element 


element node on the same level. 


attribute 


attribute node on the same level. 


choice 


choice node on the same level. 


sequence 


sequence node on same level or one level deeper; 


quantifier 


quantifier node on same level or one level deeper; 


Figure 6: Matching Candidate Set in Pass 1 
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element 


element node on same level or one level deeper; 
sequence node on same level; 
attribute node on same level. 


attribute 


element node on the same level. 


choice 


choice node on same level or one level deeper. 


sequence 


sequence node on same level or one level deeper; 
* 1; 

quantifier node on same level. 


quantifier 


quantifier node on same level or one level deeper; 
sequence node on same level. 





Figure 7: Matching Candidate Set in Pass 2 



After visiting all children of n\ , we begin pass 2 and traverse 
all unmatched children of n\ again, comparing them with 
possible candidates. Again, we apply DMatch to mi and 
each node s, in S and find a node Sk with the least matching 
cost c. Now a mnst-match scheme is applied in contrast to 
the delay-match scheme in pass 1. mi would be matched 
with Sk if c is less than the cost of deleting mi and adding Sk- 
Figures 6 and 7 show the nodes that will be selected into the 
matching candidate set S in pass 1 and pass 2 respectively. 



The pseudo code of the DMatch algorithm is shown in Figure 
S. The full details of the algorithm, including discussion of 
optimality, time complexity, etc. can be found in [15]. 

DMatch(n i . n 3 ) 
{//pass 1 

for tii's each child nj| 
{ 

Set 5 = m,'s matching candidate set in pass I; 
for each node s; G S 
apply DMatch(trii . Si): 

select a node sl- in 5 associated with a least, cost.; c 
if c < the cost of deleting »i 1 
match mi and si. / / dctntj-mn Ich scheme 

} 

//pass 2 

for nVs each unmatched child m\ 
{ 

Set. S' = m\'s matching candidate set in pass 2; 
for each node 5' £ 5' 

apply DMatch(m\. s' ; ): 
select a node s[. in S' associated wit h a least cosi. c' ; 
if c' < the cost of deleting m\ + t he cost of adding s[, 

match m\ and s[. . // must -match scheme 

} } 

Figure 8: Pseudo Code of DMatch. Algorithm 

Given matching root element types. and R-2 of two DTDs, 
we apply DMatch to the roots of R\ and /?.?*s simplified tree. 
The root match is propagated down the tree and matches 
between the name-match nodes of element types E\ and 
Eo are identified. We then recursively apply the DMatch 
algorithm to E\ and £Vs simplified trees until no new name- 
match node matches are generated. 

5.3 Example Illustrating Matching Process 

We now describe how the match discovery between DTD 1 
and DTD 2 depicted in Figures 3 and 4 would be done by 
our system. We will use the same settings as. shown in the 
examples in Section 4.2. 

.As shown in Figures 3 and 4, there are four pairs of sim- 
plified element, trees, i.e., company, personnel, person and 
name. We apply DMatch to the root type company's sim- 
plified element trees first. We traverse <company> j 's chil- 
dren one by one. For <addre<>s> 1 , its matching candidate 
set is empty since all the element nodes on the same level 
(i.e., 2) are non-rename-able. For <cname>\ , its matching 
candidate set contains only <cnnme>j. Since they have the 
same name, they are matched. Similarly, <personnel> 1 is 
matched against <per$onnel>- 2 . For attribute <id>i, its 
matching candidate set is empty. In pass 2, <address> i's 
matching candidate set contains only <.>•>. We apply DMatch 
to thern and derive the transformation script composed of 
an operation of relabeling "address" to As illustrated 

in Section 4, <address> 1 will be mapped to <,> 2 . At- 
tribute </2cense>rs matching candidate set now contains 
element <license>?. And with the parameter setting, they 
will be matched. Now each of <compfiny> l , a children has 
a partner. Hence we are done with matching element type 
company. 

We continuously apply DMatch. to the element simplified 
trees of each pair of element type matched by name, i.e., 
personnel, person and name. In this way, all matches be- 
tween them are discovered. 
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6. DISCUSSION 

Once the relationship has been set up, an XSLT generator 
will generate an XSLT script for transforming, the source 
XML documents into the target, format. We have imple- 
mented a working system XTra and run experiments on it 
The data sets we are using for experimentation include both 
real world data collected from [20] and synthetic data. It 
turns out that our algorithm can satisfactorily discover ac- 
ceptable transformations. Due to the space limitation, we 
do not furt her discuss them here. The details can be found 
in [15]. 

7. CONCLUSION AND FUTURE WORK 

This work proposes an approach for automating the trans- 
formation of XML documents. Specifically, we focus on two 
fundamental problems. First, we address the problem of 
how to automate the identification of semantic relationships 
between XML- based documents. To this end, we propose 
a set. of DTD transformation operations that capture com- 
mon discrepancies between alternative DTD design behav- 
iors for modeling a given entity. We also define a cost model 
for quantifying the quality of XML schema transformations. 
Second, we have developed an algorithm that performs the 
actual transformation of an XML-based document from a 
given schema to a different, yet related, schema. Our work 
is unique because we incorporate domain-specific character- 
istics of the.XML documents, such as domain ontology, com- 
mon transformation types, and specific DTD modeling con- 
structs (e.g., quantifiers and type-constructors). This allows 
us to avoid the high level of user interaction as well as the 
complexity required by other approaches. We have imple- 
mented a prototype system (Xtra), and run experiments on 
both real and synthetic data to verify the validity of our 
approach [15]. 

XML- Sch em a (18] is emerging as a potential standard for 
describing the structure of XML documents. In the future 
we could investigate how to adapt our approach to exploit 
the richer treatment of types offered by XML Schema as 
additional hints of similarity. 
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